Skip to content

[FSTORE-1470] Documentation and Guides for Snowflake Schema #398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 4, 2024

Conversation

davitbzh
Copy link
Contributor

No description provided.

@davitbzh davitbzh requested a review from jimdowling July 25, 2024 12:51
Copy link
Contributor

@jimdowling jimdowling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your data model is off.
Check here:
https://docs.google.com/document/d/12slqcD6Fu1DB47CTybAdCBO8NwocZGGK-wovm6mtKdk/edit

merchant_details should be joined on merchant_id
aggregated_cc_trans is a child of cc_transactions.
account_details is a child of aggregated_cc_trans

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the figures, the PKs should always be the first columns in the tables.
They should be called FK (not PK) in the tables that point to the tables that own the PKs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of columns in tables should be:

  1. primary keys
  2. event_time
  3. partition keys
  4. features
  5. foreign keys

.join(location_fg.select_all(), left_on=["location_id"], right_on=["id"], join_type="left")
selected_features = credit_card_transactions_fg.select_all() \
.join(account_details_fg.select_all(), on=["cc_num"]) \
.join(merchant_details_fg.select_all(), left_on=["cc_num"], right_on=["merchant_id"], join_type="left")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why explicitly set join_type=left?
You should document that's the default join type, but no need to include it here, right?
It's very rare you would need to change the join_type, so KISS here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good example:
.join(merchant_details_fg.select_all(), left_on=["cc_num"], right_on=["merchant_id"],

This is, imo, a better example
.join(merchant_details_fg.select_all(), left_on=["merchant_id"], right_on=["id"],

Often the FK has the name "<fg_name>_id", and the FG itself as "id" as its PK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding join_type, we are talking about more complex joins. So added comment about default join type and changed to "inner" just to make the point.


# save the query to feature view
feature_view = fs.create_feature_view(
name='rain_dataset',
query=feature_join
name='credit_card_fraud',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an explicit version to the feature view?
Labels in the feature view?

```
More complex joins are possible by selecting subsets of features from the joined feature groups and by specifying a join key and type.
Possible join types are "inner", "left" or "right". Furthermore, it is possible to specify different features for the join key of the left and right feature group.
The join key lists should contain the names of the features to join on.

=== "Python"
```python
feature_join = rain_fg.select_all() \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, to keep join_type her.e

```

#### Snowflake schema
Hopsworks also provides the possibility to define children and grandchildren of the root (the left most) feature group.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopsworks also provides the possibility to define a feature view that consists of a nested tree of children (to up to a depth of 10) from the root (left most) feature group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a max depth for the tree?

#### Snowflake schema
Hopsworks also provides the possibility to define children and grandchildren of the root (the left most) feature group.
This is called Snowflake Schema data model where you need to build nested tables (subtrees) using joins, and then join the
subtrees to their parents which can be children of the root node (the leftmost feature group):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subtrees to their parents iteratively until you reach the root node (the leftmost feature group in the feature selection)


selected_features = credit_card_transactions.select_all()
.join(nested_selection)
.join(merchant_details.select_all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a closing parentheses "):

.join(merchant_details.select_all()
```

Now, you have the benefit that in online inference you only need to pass two foreign key values to retrieve the precomputed features:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass two serving key values (the foreign keys of the leftmost feature group)

.join(location_fg.select_all(), left_on=["location_id"], right_on=["id"], join_type="left") \
.filter((rain_fg.location_id == 10) | (rain_fg.location_id == 20))
selected_features = credit_card_transactions_fg.select_all() \
.join(account_details_fg.select_all(), on=["cc_num"]) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc_num -> merchant_id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its account_details_fg, cc_num is the foreign key here in credit_card_transactions_fg and id or transaction_id will be pk

@davitbzh davitbzh requested a review from jimdowling July 26, 2024 14:39
@davitbzh davitbzh requested a review from rcnnnghm September 30, 2024 12:55
@davitbzh davitbzh requested a review from rcnnnghm September 30, 2024 13:02
@jimdowling jimdowling merged commit 86856bc into logicalclocks:main Oct 4, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants